Pipelines, nf-core,
and nf-core rnaseq
—-

Jelmer Poelstra

MCIC Wooster, OSU

2024-03-27

Informal and formal pipelines

What do we mean by a “pipeline” or “workflow”?

snakemake_dag all count map trim smp: smpG map trim smp: smpC map trim smp: smpA trimmedFASTQ BAM counttable rawFASTQ

A very basic informal pipeline script

samples=(smpA smpC smpG)

# Trim:
for sample in ${samples[@]}; do
    scripts/trim.sh data/"$sample".fq > res/"$sample"_trim.fq
done

# Map:
for sample in ${samples[@]}; do
    scripts/map.sh res/"$sample"_trim.fq > res/"$sample".bam
done

# Count:
scripts/count.sh ${samples[@]} > res/count_table.txt

Why create a pipeline / workflow?

What are the advantages of creating such a pipeline or workflow, rather than running scripts one-by-one as needed?

  • Rerunning everything is much easier.

  • Re-applying the same set of analyses in a different project is much easier.

  • Ensuring you are including all necessary steps.

  • The pipeline is a form of documentation of the steps taken.

  • Improving reproducibility, e.g. making it easier for others to repeat your analysis.

Pipeline script challenges

Rerunning part of the pipeline may be necessary, e.g. after:

  • Some scripts fail for all or some samples.

  • Adding a sample.

  • Needing to modify a script or settings somewhere halfway the pipeline.


Batch job submissions pose problems:

for sample in ${samples[@]}; do
    sbatch scripts/trim.sh data/"$sample".fq > res/"$sample"_trim.fq
    sbatch scripts/map.sh res/"$sample"_trim.fq > res/"$sample".bam
done

sbatch scripts/count.sh ${samples[@]} > res/count_table.txt

What is the problem here? Steps that need output files from prior steps won’t wait for those prior steps!

Potential solutions to these challenges

  • Informal pipeline
    Rather than running the pipeline script as an actual script, just use it as scaffolding and submit interactively jobs one-by-one / group-by-group. You don’t have a formal pipeline but that may be OK.

  • Push the limits of the Bash and Slurm tool set
    Work with if statements, many arguments to scripts, and Slurm “job dependencies” — but this is very hard to manage for more complex workflows.

  • Use a formal workflow management tool.

The need for specialized tools

Typically, researchers codify workflows using general scripting languages such as Python or Bash. But these often lack the necessary flexibility.

Workflows can involve hundreds to thousands of data files; a pipeline must be able to monitor their progress and exit gracefully if any step fails. And pipelines must be smart enough to work out which tasks need to be re-executed and which do not.

Workflow management systems

Workflow tools, often called “workflow management systems”, provide ways to formally describe and execute workflows/pipelines.


Some of the most commonly used (command-line based) options in bioinformatics are:

  • Nextflow
  • Snakemake
  • Common Workflow Language (CWL)
  • Workflow Description Language (WDL)

Advantages of formal workflow management

Automation

  • Detect & rerun upon changes in input files and failed steps.
  • Easily run for other data sets.
  • Automate Slurm job submissions.

Reproducibility and transparency

  • Document dependencies among data and among code.
  • Integration with software management.

Flexibility, portability and scalability by separating generic pipeline nuts-and-bolts and:

  1. Run-specific configuration — samples, directories, settings/parameters.
  2. Things specific to the run-time environment (laptop vs. cluster vs. cloud).

Writing pipelines vs. using them

Most workflow tools are small languages (domain-specific languages, DSLs), often a sort of extension of a more general language (Python for Snakemake, Groovy for Nextflow).


Learning one of these to write your own workflows is perhaps only worth it if you plan to at least somewhat commonly work on genomics/bioinformatics projects.


But there are also publicly available and well-documented workflows written by others that you can use.

nf-core

nf-core curates Nextflow pipelines

The “nf-core” initiative (https://nf-co.re) curates a set of best-practice, flexible, and well-documented Nextflow workflows.

It also has some of its own tooling built on top of Nextflow to help contributors create more robust and user-friendly workflows.

The nf-core and Nextflow papers


https://www.nature.com/articles/s41587-020-0439-x

The nf-core and Nextflow papers (cont.)


https://www.nature.com/articles/nbt.3820

nf-core pipelines

nf-core currently has 58 complete pipelines — these are the four most popular ones:


Today’s lab: nf-core rnaseq

We will run the nf-core rnaseq pipeline (https://nf-co.re/rnaseq), a reference-based RNA-seq pipeline that takes FASTQ and reference genome files as inputs to produce a gene count table and many “QC outputs”.

Notes on running an nf-core pipeline

Running a pipeline like this is a bit different (and more involved) than running a typical piece of bioinformatics software — it is, after all, stitching together many operations. Some considerations:

  • We only need to install Nextflow and we merely download the workflow files.

  • The pipeline runs all its constituent tools via Singularity containers (most common & recommended) or via Conda environments.

  • The pipeline will submit Slurm batch jobs for us. It tries to parallelize as much as possible, so there are many separate jobs.
  • When using the Nextflow -resume option, the workflow will check what needs to be (re)run and what doesn’t: this is quite intricate but generally works well.
  • Nextflow makes a distinction between the final output dir and the “working dir”. In the latter, jobs run and produce outputs, some of which are copied the the output dir.

A closer look at nf-core rnaseq

RNA-seq to do what?

The nf-core rnaseq pipeline is meant for RNA-seq projects that:

  • Attempt to sequence only mRNA while avoiding non-coding RNAs (“mRNA-seq”).
  • Do not distinguish between RNA from different cell types (“bulk RNA-seq”).
  • Use short reads (≤150 bp) that do not cover full transcripts but do uniquely ID genes.
  • Use reference genomes (are reference-based) to associate reads with genes.
  • Downstream, such projects typically aim to statistically compare expression between groups of samples, and have multiple biological replicates per group.

That might seem quite specific, but this is by far the most common use RNA-seq use case.

Two main parts to RNA-seq data analysis

There are two main parts to the kind of RNA-seq data analysis I just described:


  • From reads to counts
    • Generating a count table using the reads & the reference genome.
    • This what nf-core rnaseq does, which we will run in today’s lab.

  • Count table analysis
    • Exploratory data & differential expression analysis is covered in the self-study lab.
    • Functional enrichment (GO, KEGG) analysis is typically also performed.

Overview of the pipeline again

The main steps

Read QC and pre-processing

  1. Read QC (FastQC)

  2. Adapter and quality trimming (TrimGalore)

  3. Optional removal of rRNA (SortMeRNA) — off by default, but we will include this

Alignment & quantification

  1. Alignment to the reference genome/transcriptome (STAR)

  2. Gene expression quantification (Salmon)

Post-processing, QC, and reporting

  1. Post-processing alignments: sort, index, mark duplicates

  2. Alignment/count QC (RSeQC, Qualimap, dupRadar, Preseq, DESeq2)

  3. Create a report (MultiQC)

Pipeline variants

This pipeline is quite flexible and you can turn several steps off, add optional steps, and change individual options for most tools that the pipeline runs.

  • Optional removal of contaminants (BBSplit)
    Map to 1 or more additional genomes whose sequences may be present as contamination, and remove reads that map better to contaminant genomes.

  • Alternative quantification routes
    • Use RSEM instead of Salmon to quantify.
    • Skip STAR and perform direct pseudo-alignment & quantification with Salmon.

  • Transcript assembly and quantification (StringTie)
    While the workflow is focused on gene-level quantification, it does produce transcript-level counts as well (this is run by default).

Questions?

Bonus slides

From reads to counts: overview

This part is “bioinformatics-heavy” with large files, a need for lots of computing power, and command-line (Unix shell) programs — it specifically involves:

  1. Read pre-processing

  2. Aligning reads to a reference genome

  3. Quality control (QC) of both the reads and the alignments

  4. Quantifying expression levels

Reads to counts: read pre-processing

Read pre-processing includes the following steps:


  • Checking the quantity and quality of your reads
    • Does not change your data, but helps decide next steps / sample exclusion
    • Also useful to check for contamination, library complexity, and adapter content

  • Removing unwanted sequences
    • Adapters, low-quality bases, and very short reads
    • rRNA-derived reads (optional)
    • Contaminant sequences (optional)

Reads to counts: alignment to a reference genome

The alignment of reads to a reference genome needs to be “splice-aware”.


Berge et al. 2019

Reads to counts: alignment to a reference genome

Alternatively, you can align to the transcriptome (i.e., all mature transcripts):

Berge et al. 2019

Reads to counts: alignment QC

Here are some examples of the kinds of things to pay attention to:

  • Alignment rates
    What percentage of reads was successfully aligned? (Should be >80%)

  • Alignment targets
    What percentages of aligned reads mapped to exons vs. introns vs. intergenic regions?
What might cause high intronic mapping rates? An abundance of pre-mRNA versus mature-mRNA.
What might cause high intergenic mapping rates? DNA contamination or poor genome assembly/annotation quality

Reads to counts: quantification

At heart, a simple counting exercise once you have the alignments in hand.
But made more complicated by sequencing biases and multi-mapping reads.


Current best-performing tools (e.g. Salmon) do transcript-level quantification — even though this is typically followed by gene-level aggregation prior to downstream analysis.



Fast-moving field

Several very commonly used tools like FeatureCounts (>15k citations) and HTSeq (<18k citations) have become disfavored in the past couple of years, as they e.g. don’t count multi-mapping reads at all.

Count table analysis

The second part of RNA-seq data analysis involves analyzing the count table.
In contrast to the first part, this can be done on a laptop and instead is heavier on statistics, data visualization and biological interpretation.


It is typically done with the R language, and common steps include:

  • Principal Component Analysis (PCA)
    Assessing overall sample clustering patterns

  • Differential Expression (DE) analysis
    Finding genes that differ in expression level between sample groups (DEGs)

  • Functional enrichment analysis
    See whether certain gene function groupings are over-represented among DEGs

Minimal Nextflow workflow example

params.greeting = 'Hello world!' 
greeting_ch = Channel.of(params.greeting) 

process SPLITLETTERS { 
    input: 
    val x 

    output: 
    path 'chunk_*' 

    script: 
    """
    printf '$x' | split -b 6 - chunk_
    """
} 

process CONVERTTOUPPER { 
    input: 
    path y 

    output: 
    stdout 

    script: 
    """
    cat $y | tr '[a-z]' '[A-Z]'
    """
} 

workflow { 
    letters_ch = SPLITLETTERS(greeting_ch) 
    results_ch = CONVERTTOUPPER(letters_ch.flatten()) 
    results_ch.view { it } 
}